import collections
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from wordcloud import WordCloud , ImageColorGenerator
from PIL import Image
from plotly.offline import plot
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
The data has been acquired from Kaggle open datasets. It is a raw dataset named “indian_food”, which represents entirety of the testing data for August 5th to October 5th, 2020.
Independent Variables- Actual measurement parameters of an Indian Dish which are name, ingredients, prep_time, cook_time, flavor_profile, course, state, region.
Dependent Variable- Classification of dish as part of the diet(vegetarian and non-vegetarian).
df = pd.read_csv('indian_food.csv')
df.head()
df.shape
df.info()
It can be seen that only cook_time and prep_time are numeric continious variables. Others are categorical variables which needs to be one-hot encoded before model building.
df.isnull().sum()
fig_flavorprofile = sns.countplot(data=df, x="flavor_profile", order = df['flavor_profile'].value_counts().index)
fig_flavorprofile.set_title("flavor_profile countplot")
The below diagram shows the distribution of dishes in terms of the flavor profile. There is an imbalance in the distribution as can be seen from the numbers. Imbalance - The clean dataset has an imbalance of 226:29 for Vegetarian: Non-Vegetarian. This is handled by SMOTE so that it does not lead to overfitting of vegetarian data points. The other solution for this is more data.
pie_chart = df.diet.value_counts().reset_index()
pie_chart.columns = ['diet','count']
fig = px.pie(pie_chart, values='count', names='diet', title='Vegetarian and Non-Vegetarian dishes Ratio')
fig.show()
ingredients = []
for i in range(len(df)):
single_dish_ingredients = df["ingredients"][i]
ingredients = ingredients + [word.lower() for word in nltk.word_tokenize(single_dish_ingredients) if not word in ['.', ',']]
print(ingredients)
word_freq={}
word_freq = collections.Counter(ingredients)
W = WordCloud(background_color="white").fit_words(word_freq)
plt.figure(figsize = (10, 10), facecolor = None)
plt.imshow(W)
plt.axis('off')
plt.show()
for i in range(0,len(ingredients)):
text = ' '.join(ingredients)
india_coloring = np.array(Image.open('ind.jpg'))
wc = WordCloud(background_color="white", width = 400, height = 400, mask=india_coloring, min_font_size=8)
wc.generate(text)
image_colors = ImageColorGenerator(india_coloring)
plt.figure(figsize = (20, 20))
plt.imshow(wc.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis('off')
plt.show()
words = np.array(list(word_freq.keys()))
print(words)
def create_ingredientsVector(ingredients):
ingredients_vec = np.zeros(words.shape)
ingredients = set([word.lower() for word in nltk.word_tokenize(ingredients) if not word in ['.', ',']])
for ingredient in ingredients:
idx = np.where(words == ingredient)
ingredients_vec[idx] = 1
return ingredients_vec.tolist()
df["ingredients_vec"] = df["ingredients"].map(create_ingredientsVector)
df.head()
The ingredients need to be tokenized for further analysis. This is the most important feature of the model where the diet is predicted mainly using the ingredients. The food ingredients are taken and vectors are created for each and every dish. This is similar to label encoding. This is done so that the algorithm can process the contents of the dish in the prediction process. The vectors are of the shape (255, 337) or (no of dishes, no of total ingredients).
ingredients_vecs = []
for i in range(len(df)):
ingredients_vecs.append(df["ingredients_vec"][i])
ingredients_vecs = np.array(ingredients_vecs)
print(ingredients_vecs.shape)
from sklearn.metrics.pairwise import cosine_similarity
cos_simi_matrix = cosine_similarity(ingredients_vecs, ingredients_vecs)
plt.figure(figsize=(20, 20))
fig = sns.heatmap(cos_simi_matrix, cmap="Spectral")
fig.set_title("Cosine Similarity of Ingredient Vectors")
Correlation - The correlation heatmap was used to check the correlation between the features and no highly correlated feature were indicated. All correlations are less than 0.8, indicating low correlation.
In the Heatmap 0 - 66th (these numbers correspond to index of "data frame") ingredients vectors have high cosine similarity each other. Cosine similarity to calculate similarity of ingredient vectors. If cosine similarity between two foods is high, it can be inferred that dishes are similar.
df[df['name'].isin(['Kheer', 'Phirni', 'Rabri'])]
Ingredient vectors are used to check the cosine similarity between 2 dishes. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two vectors are far apart by the Euclidean distance, chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.
cosine_similarity([ingredients_vecs[9]], [ingredients_vecs[14]])
cosine_similarity([ingredients_vecs[14]], [ingredients_vecs[15]])
As seen above the dishes at location 9, 14, and 15 are sweet dishes and have very similar ingredients. Therefore the cosine similarity between the ingredient vectors is high, indicating actual closeness between the dishes.
df.iloc[30]['name']
cosine_similarity([ingredients_vecs[9]], [ingredients_vecs[30]])
As seen above the dich at location 30 is a savory dish compared to a sweet dish. Therefore having a small cosine similarity between them, Indicating no actual closeness between the contents of the dishes.
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 10):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10,random_state = 100)
kmeans.fit(ingredients_vecs)
wcss.append(kmeans.inertia_)
#Plot Elbow Method
plt.plot(range(1, 10), wcss,marker='o')
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()
It is a supervised learning problem but we would like to see in haow may categories would a dish be categorised given that the data set was not labelled. For This we use K-means algorithm. It is a clustering algorithm which searches for a pre-determined number of clusters within an unlabeled multidimensional dataset.
#Create Silhouette Coefficients
from sklearn.metrics import silhouette_score
for n_cluster in range(2, 10):
kmeans = KMeans(n_clusters=n_cluster).fit(ingredients_vecs)
label = kmeans.labels_
sil_coeff = silhouette_score(ingredients_vecs, label, metric='euclidean')
print('For n_clusters= {}, The Silhouette Coefficient is {}'.format(n_cluster, sil_coeff))
Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other. The Silhouette score is calculated for each sample of different clusters. In the set of Silhouette score where ever there is an abrupt change that is the point of optimal number of clusters.
food_vocab = set()
for ingredients in df['ingredients']:
for food in ingredients.split(','):
if food.strip().lower() not in food_vocab:
food_vocab.add(food.strip().lower())
len(food_vocab)
print(food_vocab)
The ingredients of in the food are extracted and a vector dataframe is created using the ingredients. Each row is a vector that represents the ingredients in the dish.
food_columns = pd.DataFrame()
for i, ingredients in enumerate(df['ingredients']):
for food in ingredients.split(','):
if food.strip().lower() in food_vocab:
food_columns.loc[i, food.strip().lower()] = 1
food_columns = food_columns.fillna(0)
food_columns
data = pd.read_csv('indian_food.csv')
data = data.drop(['name', 'ingredients'], axis=1)
The column name is removed because it provdes no information to the model. Ingredients feature is removed and instead of the ingedients the vector equivalent of the ingredients would be used.
{column: list(data[column].unique()) for column in data.columns if data.dtypes[column] == 'object'}
Unique values are confirmed in all the categorical variables in the dataset. It can be seen that there are "-1" values in the dataset the needs to be removed.
data[['flavor_profile', 'state', 'region']] = data[['flavor_profile', 'state', 'region']].replace('-1', np.NaN)
The "-1" values in the dataset are replaced with NaN. This would be later replaced with the mean of the corresponding feature for the continious numerical features.
def onehot_encode(df, columns, prefixes):
df = df.copy()
for column, prefix in zip(columns, prefixes):
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df
data = onehot_encode(
data,
['flavor_profile', 'course', 'state', 'region'],
['f', 'c', 's', 'r']
)
'flavor_profile', 'course', 'state', 'region' are one hot encoded for model building.
data
data[['prep_time', 'cook_time']] = data[['prep_time', 'cook_time']].replace(-1, np.NaN)
data['prep_time'] = data['prep_time'].fillna(data['prep_time'].mean())
data['cook_time'] = data['cook_time'].fillna(data['cook_time'].mean())
label_encoder = LabelEncoder()
data['diet'] = label_encoder.fit_transform(data['diet'])
{index: label for index, label in enumerate(label_encoder.classes_)}
y = data['diet']
X = data.drop('diet', axis=1)
X_food = pd.concat([X, food_columns], axis=1)
food_columns.shape
sc = StandardScaler()
X = sc.fit_transform(X)
X_food = sc.fit_transform(X_food)
Standard Scaler is used to standardize and normalize the values in the dataset.
X_food.shape
from imblearn.over_sampling import SMOTE
smt=SMOTE(random_state=100)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train_smt,y_train_smt = smt.fit_resample(X_train,y_train)
X_food_train, X_food_test, y_food_train, y_food_test = train_test_split(X_food, y, train_size=0.7, random_state=42)
X_food_train_smt,y_food_train_smt = smt.fit_resample(X_food_train,y_food_train)
Train Test split is used to create 2 sets to data. First without ingredient vector dataframe attached and sencond with the ingredient vector dataframe appended to the other normalized and encoded features. SMOTE analysis is used to comensate the under-represented class's data points. After SMOTE we have 156 data points for both classes.
print('Train Data - Class Split - Without Ingredient Vectors')
Outcome_0= (y_train_smt == 0).sum()
Outcome_1 = (y_train_smt == 1).sum()
print('Class 0 (Non Vegetarian)-', Outcome_0)
print('Class 1 (Vegetarian)-', Outcome_1)
print('\n')
print('Train Data - Class Split - With Ingredient Vectors')
Outcome_0= (y_food_train_smt == 0).sum()
Outcome_1 = (y_food_train_smt == 1).sum()
print('Class 0 (Non Vegetarian)-', Outcome_0)
print('Class 1 (Vegetarian)-', Outcome_1)
import tensorflow
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
import tensorflow.keras.metrics
Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. Neural networks help us cluster and classify.In a neural network a single perceptron (neuron) can be imagined as a Logistic Regression. Artificial Neural Network, or ANN, is a group of multiple perceptrons/ neurons at each layer. ANN is also known as a Feed-Forward Neural network because inputs are processed only in the forward direction.
It consists of 3 layers – Input, Hidden and Output. The input layer accepts the inputs, the hidden layer processes the inputs, and the output layer produces the result. Essentially, each layer tries to learn certain weights.
def build_model(num_features, hidden_layer_sizes=(64, 64)):
model = Sequential()
model.add(InputLayer(input_shape=(num_features, )))
model.add(Dense(hidden_layer_sizes[0], activation='relu'))
model.add(Dense(hidden_layer_sizes[1], activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', tensorflow.keras.metrics.AUC(name='auc')])
model.summary()
return model
The First model is an Neiral Network which is created with the dataset without the ingredient vectors appended. It has has 4 layers Input, Output and 2 hidden layers. The input layers has the size equivalent to the number of features in the dataset. In this case 40. The batch size is kept 64 and the the number of epochs is defined as 41.
X.shape
model = build_model(40)
batch_size = 64
epochs = 41
history = model.fit(
X_train_smt,
y_train_smt,
validation_split=0.2,
batch_size=batch_size,
epochs=epochs
)
plt.figure(figsize=(20, 10))
epochs_range = range(1, epochs + 1)
train_loss, val_loss = history.history['loss'], history.history['val_loss']
train_auc, val_auc = history.history['auc'], history.history['val_auc']
plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_loss, label="Training Loss")
plt.plot(epochs_range, val_loss, label="Validation Loss")
plt.title("Loss")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_auc, label="Training AUC")
plt.plot(epochs_range, val_auc, label="Validation AUC")
plt.title("AUC")
plt.legend()
plt.show()
The performance of the model is plotted. As it can be seen
In the first plot the validation loss decreases with the training loss. At the end of the graph there are a few fluctuations. This is the optimum amount upto which the validation loss can be reduced. As it can be seen ahead that the position at which the loss is minimum is 40. So it cannot be reduced further.
The second graph shows the training AC vs the validation AUC. The training AUC increases initially and then becomes constant. But the validation AUC is constant which show the TPR-FPR (True Positive Rate - False Positive Rate) is constant. In addition the the point at which the value of AUC is maximum is 0.
print(np.argmin(val_loss), np.argmax(val_auc))
model.evaluate(X_test, y_test)
The average loss, accuracy and AUC are 0.486, 0.857, 0.682 respectively. The model has a good accuracy but a decent AUC.
from sklearn.metrics import classification_report, confusion_matrix
# predict probabilities for test set
yhat_probs = model.predict(X_test, verbose=0)
# predict crisp classes for test set
yhat_classes = model.predict_classes(X_test, verbose=0)
# confusion matrix
print('\nConfusion Matrix and Classification Report\n')
matrix = confusion_matrix(y_test, yhat_classes)
print(matrix)
target_names=['Class 0 - non-vegetarian','Class 1 - vegetarian']
print(classification_report(y_test, yhat_classes, target_names=target_names))
For Neural Network model evaluation Without the ingredient vectors The Confusion Matrix and Classification Report for the Standard Model gives the following results as seen:
• For Class 0 (Non-Vegetarian), 2 identified correctly 6 identified incorrectly.
• For Class 1 (Vegetarian), 64 identified correctly 5 identified incorrectly.
• True Positives – Number of correctly predicted positive values is 2.
• True Negatives - Number of correctly predicted negative values is 64.
• False Positives – Number of negative values incorrectly predicted as positive values is 6.
• False Negatives – Number of positive values incorrectly predicted as negative is 5.
The model has overall Precision of 87% and overall Accuracy of 86%
The Precision for Class 0 / Non-Vegetarian is 25% due to large number of False positives.
X_food.shape
The Second model is a Neural Network which is created with the dataset with the ingredient vectors appended. It has has 4 layers Input, Output and 2 hidden layers. The input layers has the size equivalent to the number of features in the dataset which includes the food vectors dataframe appended to the transformed dataframe. In this case 405. The batch size is kept 64 and the the number of epochs is defined as 200.
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
food_model = build_model(X_food.shape[1], hidden_layer_sizes=(128, 128))
food_batch_size = 64
food_epochs = 200
food_history = food_model.fit(
X_food_train_smt,
y_food_train_smt,
validation_split=0.2,
batch_size=food_batch_size,
epochs=food_epochs
)
plt.figure(figsize=(20, 10))
food_epochs_range = range(1, food_epochs + 1)
food_train_loss, food_val_loss = food_history.history['loss'], food_history.history['val_loss']
food_train_auc, food_val_auc = food_history.history['auc'], food_history.history['val_auc']
plt.subplot(1, 2, 1)
plt.plot(food_epochs_range, food_train_loss, label="Training Loss")
plt.plot(food_epochs_range, food_val_loss, label="Validation Loss")
plt.title("Loss")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(food_epochs_range, food_train_auc, label="Training AUC")
plt.plot(food_epochs_range, food_val_auc, label="Validation AUC")
plt.title("AUC")
plt.legend()
plt.show()
The performance of the model is plotted. As it can be seen
In the first plot the validation loss decreases with the training loss. Both the line alomost lie on top of each each other which shows that the model is training well. As it can be seen ahead that the position at which the loss is minimum is 199. So it cannot be reduced further.
The second graph shows the training AC vs the validation AUC. The training AUC increases initially and then becomes constant. But the validation AUC is constant which show the TPR-FPR (True Positive Rate - False Positive Rate) is constant. In addition the the point at which the value of AUC is maximum is 0.
print(np.argmin(food_val_loss), np.argmax(food_val_auc))
food_model.evaluate(X_food_test, y_food_test)
The average loss, accuracy and AUC are 0.267, 0.935, 0.89 respectively. The model has a good accuracy and AUC.
# predict probabilities for test set
yhat_probs = food_model.predict(X_food_test, verbose=0)
# predict crisp classes for test set
yhat_classes = food_model.predict_classes(X_food_test, verbose=0)
# confusion matrix
print('\nConfusion Matrix and Classification Report\n')
matrix = confusion_matrix(y_food_test, yhat_classes)
print(matrix)
target_names=['Class 0 - non-vegetarian','Class 1 - vegetarian']
print(classification_report(y_food_test, yhat_classes, target_names=target_names))
The Neural Network model evaluation With the ingredient vectors is shown below. The Confusion Matrix and Classification Report for the Standard Model gives the following results as seen:
• For Class 0 (Non-Vegetarian), 3 identified correctly 8 identified incorrectly.
• For Class 1 (Vegetarian), 62 identified correctly 4 identified incorrectly.
• True Positives – Number of correctly predicted positive values is 3.
• True Negatives - Number of correctly predicted negative values is 62.
• False Positives – Number of negative values incorrectly predicted as positive values is 8.
• False Negatives – Number of positive values incorrectly predicted as negative is 4.
The model has overall Precision of 88% and overall Accuracy of 84%.
The Precision for Class 0 / Non-Vegetarian is 27% due to large number of False positives.
Logistic regression is a classification algorithm used when the dependent variable is binary. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval independent variables.
There are 405 features in the model which might affect the model due to high dimensionality. Therefore, for dimensionality reduction PCA(Principal Component analysis) is conducted. Principal component analysis computes a new set of variables (principal components) and expresses the data in terms of these new variables.
from sklearn.decomposition import PCA
pca_none = PCA(n_components=None,random_state=100)
X_pca = pca_none.fit(X_food, y)
pca_var_ratios = pca_none.explained_variance_ratio_
# Create a function
def select_n_components(var_ratio, goal_var: float) -> int:
total_variance = 0.0
n_components = 0
for explained_variance in var_ratio:
total_variance += explained_variance
n_components += 1
if total_variance >= goal_var:
break
return n_components
n_comppca=select_n_components(pca_var_ratios, 0.95)
print(n_comppca)
pca = PCA(n_components=n_comppca,svd_solver='full')
transformed_data = pca.fit_transform(X_food)
x_train_log,x_test_log,y_train_log,y_test_log = train_test_split(X_food, y, test_size = 0.2, random_state = 100)
x_train_log_smt,y_train_log_smt = smt.fit_resample(x_train_log,y_train_log)
Train Test split is used to create the training set. The dataset used is with the ingredient vector dataframe appended to the other normalized and encoded features. SMOTE analysis is used to comensate the under-represented class's data points. After SMOTE we have 182 data points for both classes.
print('Train Data - Class Split')
Outcome_0= (y_train_log_smt == 0).sum()
Outcome_1 = (y_train_log_smt == 1).sum()
print('Class 0 (Non Vegetarian)-', Outcome_0)
print('Class 1 (Vegetarian)-', Outcome_1)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
logreg = LogisticRegression(solver='liblinear', class_weight='balanced', random_state=100)
logreg.fit(x_train_log_smt,y_train_log_smt)
y_pred = logreg.predict(x_test_log)
target_names=['Class 0 - non-vegetarian','Class 1 - vegetarian']
print('\nNumber of PCA components:',n_comppca)
print('\nConfusion Matrix and Classification Report')
print('\n', confusion_matrix(y_test_log,y_pred))
print('\n',classification_report(y_test_log,y_pred,target_names=target_names))
print('\n')
The Confusion Matrix and Classification Report for the Standard Model gives the following results as seen:
• For Class 0 (Non-Vegetarian), 7 identified correctly 17 identified incorrectly.
• For Class 1 (Vegetarian), 27 identified correctly 0 identified incorrectly.
• True Positives – Number of correctly predicted positive values is 7.
• True Negatives - Number of correctly predicted negative values is 27.
• False Positives – Number of negative values incorrectly predicted as positive values is 17.
• False Negatives – Number of positive values incorrectly predicted as negative is 0.
The model has overall Precision of 90% and overall Accuracy of 67%
The Precision for Class 0 / Non-Vegetarian is 29% due to large number of False positives.
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, threshold = roc_curve(y_test_log, y_pred)
auc_score = roc_auc_score(y_test_log, y_pred)
print('ROC Curve')
#Plot the ROC Curve
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % auc_score)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values. The AUC score the model is 81% which is a good score as it explains that the model is 81% accurate in distinguishing between the positive and negative classes.
The dataset used has some constraints as there are only 255 records with an imbalance in the number of samples provided. More balanced data could be used to better predict and answer the problem statement. The results of the model are able to classify the Vegetarian dishes with greater accuracy because of the bias
The dataset was a clean dataset which did not require synthetics values being added to which is a good sign.
There exsists a high multi-collinearity among the food vectors as evident from the heatmap. This showed that some of the dishes are hightly similar to the other dishes
All all the features are kept for the model analysis and overall dimensionality reduction is conducted which produces good results for Logistic regression. For the Nerural Network model the Input layer shape is changed as per the data diensions
A acuuracy of 100% for Class 1 (Logistic Regression) and 95% for Class 1 (Neural network) show that the model works well for the study under consideration.
According to the problem statement the model works well classifying the Vegetarian dishes but not aacurate enough for non - vegetarian dishes because of the high imbalance.
import pickle
import joblib
capstone_1002_lg_model = 'capstone_1002_lg_model.pkl'
capstone_1002_lg_model_joblib = 'capstone_1002_lg_model.sav'
pickle.dump(logreg, open(capstone_1002_lg_model, 'wb'))
joblib.dump(logreg, capstone_1002_lg_model_joblib)
food_model.save("capstone_1002_tf_model.h5")